Fix qwen encoder hidden states mask #12655

cdutr · 2025-11-14T02:06:16Z

What does this PR do?

Fixes the QwenImage encoder to properly apply encoder_hidden_states_mask when passed to the model. Previously, the mask parameter was accepted but ignored, causing padding tokens to incorrectly influence attention computation.

Changes

Attention mask application: Modified QwenDoubleStreamAttnProcessor2_0 to create a 2D attention mask from the 1D encoder_hidden_states_mask, properly masking text padding tokens while keeping all image tokens unmasked
RoPE adjustment: Updated positional embedding computation to use the full padded sequence length when a mask is present, ensuring correct position indices
Tests: Added comprehensive tests validating that:
- Padding tokens are properly isolated and don't affect outputs
- Masked outputs differ significantly from unmasked outputs
Benchmarks: Included performance analysis showing acceptable overhead (<20% for inference, ~19% for training scenarios)

Impact

This fix enables proper Classifier-Free Guidance (CFG) batching with variable-length text sequences, which is common when batching conditional and unconditional prompts together.

Benchmark Results

Scenario	Latency (ms)	Peak Memory (MB)	Throughput (iter/s)
Baseline (no mask)	11.68 ± 0.23	301.5	84.70
Mask all-ones (no padding)	12.01 ± 0.26	301.5	82.34
Mask with padding (CFG)	13.86 ± 0.24	301.5	71.42

Overhead: +2.8% for mask processing without padding, +18.7% with actual padding (realistic CFG scenario)

The higher overhead with padding is expected and acceptable as it represents the cost of properly handling variable-length sequences in batched inference. This is a necessary correctness fix rather than an optimization. Test ran on RTX 4070 12GB.

Fixes #12294

Before submitting

This PR fixes a bug in the code
This PR adds tests that verify the fix
This PR includes benchmarks demonstrating performance impact
Did you write any new necessary tests?

Who can review?

@yiyixuxu @sayakpaul - Would appreciate your review, especially regarding the benchmarking approach. I used a custom benchmark rather than BenchmarkMixin because:

This tests a specific bug fix (mask application) rather than optimization strategies
The fix uses synthetic models to isolate the mask handling logic
Standard benchmarks focus on pretrained model performance with different quantization/offloading strategies
The metrics needed are different (latency distribution, throughput) vs standard format (compile/plain time comparison)

Note: The benchmark file is named benchmarking_qwenimage_mask.py (with "benchmarking" prefix) rather than benchmark_qwenimage_mask.py to prevent it from being picked up by run_all.py, since it doesn't use BenchmarkMixin and produces a different CSV schema. If you prefer, I can adapt it to use the standard format instead.

Happy to adjust the approach if you have suggestions!

Improves attention mask handling for QwenImage transformer by: - Adding support for variable-length sequence masking - Implementing dynamic attention mask generation from encoder_hidden_states_mask - Ensuring RoPE embedding works correctly with padded sequences - Adding comprehensive test coverage for masked input scenarios Performance and flexibility benefits: - Enables more efficient processing of sequences with padding - Prevents padding tokens from contributing to attention computations - Maintains model performance with minimal overhead

Improves file naming convention for the Qwen image mask performance benchmark script Enhances code organization by using a more descriptive and consistent filename that clearly indicates the script's purpose

sayakpaul · 2025-11-20T03:58:29Z

@cdutr it's great that you have also included the benchmarking script for fullest transparency. But we can remove that from this PR and instead have that as a GitHub gist.

The benchmark numbers make sense to me. Some comments:

Could we also check the performance with torch.compile?
Could we also see some image outputs with and without the changes introduced in this PR?

Also, I think a natural next step would be see how well this performs when combined with FA varlen. WDYT?

@naykun what do you think about the changes?

HuggingFaceDocBuilderDev · 2025-11-20T03:59:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

cdutr · 2025-11-22T03:51:49Z

Thanks @sayakpaul! I removed the benchmark script, moved all tests to this gist.

torch.compile test

Also tested the performance with torch.compile, and results were similar, the details are below.

Tested on NVIDIA A100 80GB PCIe:

Without compile: 4.70ms per iteration
With compile:    1.93ms per iteration
Speedup:         2.43x
Compilation overhead: 7.25s (one-time cost)

Also validated on RTX 4050 6GB (laptop) with similar results (2.38x speedup). The mask implementation is fully compatible with torch.compile.

Image outputs

Tested End-to-end image generation: Successfully generated images using QwenImagePipeline and pipeline runs without errors, here is the output generated:

FA Varlen

FA varlen is the natural next step, yes! I'm interested in working on it. Should I keep iterating in this PR, or should we merge it and create a new issue?

The mask infrastructure this PR adds would translate well to varlen, instead of masking padding tokens, we'd pack only valid tokens using the same sequence length information

sayakpaul · 2025-11-22T04:05:20Z

Thanks for the results! Looks quite nice.

FA varlen is the natural next step, yes! I'm interested in working on it. Should I keep iterating in this PR, or should we merge it and create a new issue?

I think it's fine to first merge this PR and then we work on it afterwards. We're adding easier support for Sage and FA2 in this PR: #12439, so after that's merged, it will be quite easy to work on that (thanks to the kernels lib).

Could we also check if the outputs deviate with and without the masks, i.e., the outputs we get on main and this PR branch?

sayakpaul · 2025-11-22T04:08:12Z

@dxqb would you maybe interested in checking this PR out as well?

dxqb · 2025-11-22T10:16:05Z

src/diffusers/models/transformers/transformer_qwenimage.py

+            )
+
+            joint_attention_mask_1d = torch.cat([text_attention_mask, image_attention_mask], dim=1)
+            attention_mask = joint_attention_mask_1d[:, None, None, :] * joint_attention_mask_1d[:, None, :, None]


this works, but optimization possible:
this generates a real 2D mask in memory, of size seq_len * seq_len. An attention mask that is broadcastable to the required shape is enough:
attention_mask_2d = attention_mask[:, None, None, :]

Why this is enough: all tokens don't attend to the masked tokens anymore. whether the masked tokens attend to any other tokens is irrelevant, because they are masked in all layers and their result is never used.
I have tested this and the results were pixel-identical.

dxqb · 2025-11-22T10:18:19Z

src/diffusers/models/transformers/transformer_qwenimage.py

+            )
+
+            joint_attention_mask_1d = torch.cat([text_attention_mask, image_attention_mask], dim=1)
+            attention_mask = joint_attention_mask_1d[:, None, None, :] * joint_attention_mask_1d[:, None, :, None]


another optimization:
attention masking is expensive, because torch SDPA switches to a flash attention algorithm internally if there is no attention mask. it cannot do that with an attention mask.

detecting a no-op attention mask can help:
attention_mask_2d = attention_mask[:, None, None, :] if not torch.all(text_attention_mask) else None

but you could also say that you expect the caller to not pass an attention mask if it's a no-op. also valid viewpoint.

dxqb · 2025-11-22T10:25:37Z

src/diffusers/models/transformers/transformer_qwenimage.py

+                    f"must match encoder_hidden_states sequence length ({text_seq_len})"
+                )
+
+            text_attention_mask = encoder_hidden_states_mask.bool()


This works if the encoder_hidden_states_mask is already bool, or a float tensor with the same semantics.
bool attention masks are enough for the usual usecase of masking unused text tokens, but if only bool attention masks are supported this should be clearly documented. also maybe change the type hint?

see https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html how float attention masks are interpreted by torch. a float 0.0 is not masked, a bool False is masked.

there are some usecases for float attention masks for text sequences, like putting an emphasis/bias on certain tokens. not very common though, so if you decide to only support bool attention masks that makes sense to me - but requires documentation.

dxqb · 2025-11-22T10:31:36Z

src/diffusers/models/transformers/transformer_qwenimage.py

+        # Use padded sequence length for RoPE when mask is present.
+        # The attention mask will handle excluding padding tokens.
+        if encoder_hidden_states_mask is not None:
+            txt_seq_lens_for_rope = [encoder_hidden_states.shape[1]] * encoder_hidden_states.shape[0]


could you please read this #12344 (comment)
I don't think this is how the txt_seq_lens parameter was intended.

However, your change here might still be a valid (temporary) fix, because it's currently (before the PR) not used as intended either

just noticed, this was already discussed in other comments above.

cdutr added 3 commits November 13, 2025 22:50

Updates benchmark results CSV filename

0d26a8a

Renames benchmark script for clarity

2bf1622

Improves file naming convention for the Qwen image mask performance benchmark script Enhances code organization by using a more descriptive and consistent filename that clearly indicates the script's purpose

cdutr mentioned this pull request Nov 14, 2025

[Qwen-image] encoder_hidden_states_mask is not used #12294

Open

sayakpaul mentioned this pull request Nov 14, 2025

The Diffusers MVP 🚀 #12635

Open

sayakpaul requested a review from yiyixuxu November 20, 2025 03:59

Removes QwenImage mask performance benchmark script

4b31966

dxqb reviewed Nov 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix qwen encoder hidden states mask #12655

Fix qwen encoder hidden states mask #12655

cdutr commented Nov 14, 2025

Uh oh!

sayakpaul commented Nov 20, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 20, 2025

Uh oh!

cdutr commented Nov 22, 2025 •

edited

Loading

Uh oh!

sayakpaul commented Nov 22, 2025

Uh oh!

sayakpaul commented Nov 22, 2025 •

edited

Loading

Uh oh!

dxqb Nov 22, 2025

Uh oh!

dxqb Nov 22, 2025

Uh oh!

dxqb Nov 22, 2025 •

edited

Loading

Uh oh!

dxqb Nov 22, 2025

Uh oh!

dxqb Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix qwen encoder hidden states mask #12655

Are you sure you want to change the base?

Fix qwen encoder hidden states mask #12655

Conversation

cdutr commented Nov 14, 2025

What does this PR do?

Changes

Impact

Benchmark Results

Before submitting

Who can review?

Uh oh!

sayakpaul commented Nov 20, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 20, 2025

Uh oh!

cdutr commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

torch.compile test

Image outputs

FA Varlen

Uh oh!

sayakpaul commented Nov 22, 2025

Uh oh!

sayakpaul commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dxqb Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

dxqb Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

dxqb Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dxqb Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

dxqb Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cdutr commented Nov 22, 2025 •

edited

Loading

sayakpaul commented Nov 22, 2025 •

edited

Loading

dxqb Nov 22, 2025 •

edited

Loading